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ABSTRACT 

Identifying a preferable route is an important problem that finds ap- 
plications in map services. When a user plans a trip within a city, 
the user may want to find "a most popular route such that it passes 
by shopping mall, restaurant, and pub, and the travel time to and 
from his hotel is within 4 hours." However, none of the algorithms 
in the existing work on route planning can be used to answer such 
queries. Motivated by this, we define the problem of keyword- 
aware optimal route query, denoted by KOR, which is to find an 
optimal route such that it covers a set of user-specified keywords, 
a specified budget constraint is satisfied, and an objective score of 
the route is optimal. The problem of answering KOR queries is 
NP-hard. We devise an approximation algorithm OSScaling with 
provable approximation bounds. Based on this algorithm, another 
more efficient approximation algorithm BucketBound is proposed. 
We also design a greedy approximation algorithm. Results of em- 
pirical studies show that all the proposed algorithms are capable of 
answering KOR queries efficiently, while the BucketBound and 
Greedy algorithms run faster. The empirical studies also offer in- 
sight into the accuracy of the proposed algorithms. 

1. INTRODUCTION 

Identifying a preferable route in a road network is an important 
problem that finds applications in map services. For example, map 
applications like Baidu Lvyou 1 and Yahoo Travel 2 offer tools for 
trip planning. However, the routes that they provide are collected 
from users and are thus pre-defined. This is a significant deficiency 
since there may not exist any pre-defined route that meets the user 
needs. The existing solutions (e.g., [16, 17, 22]) for trip planning 
or route search are often insufficient in offering the flexibility for 
users to specify their requirements on the route. 

Consider a user who wants to spend a day exploring a city. She 
is not familiar with the city and she might pose such a query: "Find 
the most popular route to and from my hotel such that it passes by 
shopping mall, restaurant, and pub, and the time spent on the road 
in total is within 4 hours." 

'http://lvyou.baidu.com/ 
2 http://travel. yahoo.com 
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The example query above has two hard constraints: 1) the points 
of interests preferred by the user, as expressed by a set of key- 
words that should be covered in the route (e.g., "shopping mall", 
"restaurant" and "pub"); 2) a budget constraint (e.g., travel time) 
that should be satisfied by the route. The query aims to identify the 
optimal route under the two hard constraints, such that an objective 
score is optimized (e.g., route popularity [4]). Note that route pop- 
ularity can be estimated by the number of users traveling a route, 
obtained from the user traveling histories recorded in sources such 
as GPS trajectories or Flickr photos [4]. In general, the budget 
constraint and the objective score can be of various different types, 
such as travel duration, distance, popularity, travel budget, etc. We 
consider two different attributes for budget constraint and objective 
score because users often need to balance the trade-off of two as- 
pects when planning their trips. For example, a popular route may 
be quite expensive, or a route with the shortest length is of little in- 
terests. In the example query, it is likely that the most popular route 
requires traveling time more than 4 hours. Hence, a route search- 
ing system should be able to balance such trade-offs according to 
users' different preferences. 

We refer to the aforementioned type of queries as keyword-aware 
optimal route query, denoted as KOR. Formally, a KOR query is 
defined over a graph Q, and the input to the query consists of five 
parameters, v s , v t , ip, A, and /, where v 3 is the source location of 
the route in G, Vt is the target location, if) is a set of keywords, A 
is a budget limit, and / is a function that calculates the objective 
score of a route. The query returns a path R in G starting at v 3 
and ending at vt, such that R minimizes f(R) under the constraints 
that R satisfies the budget limit A and passes through locations that 
cover the query keywords in ip. To the best of our knowledge, none 
of the existing work on trip planning or route search (e.g., [16, 17, 
22]) is applicable for KOR queries. Furthermore, the problem of 
solving KOR queries can be shown to be NP-hard by a reduction 
from the weighted constrained shortest path problem [8]. It can 
also be viewed as a generalized traveling salesman problem [11] 
with constraints. This leads to an interesting question: is it possible 
to derive efficient solutions to answering KOR queries? 

Due to the hardness of answering KOR queries, in this paper, we 
answer the aforementioned question affirmatively with three ap- 
proximation algorithms. The first approximation algorithm has a 
performance bound and is denoted by OSScaling. In OSScaling, 
we first scale the objective value of every edge to an integer by a 
parameter e to obtain a scaled graph denoted by Qs- Specifically, in 
the scaled graph Qs, each partial route is represented by a "label", 
which records the query keywords already covered by the partial 
route, the scaled objective score, the original objective score, and 
the budget score of the route. At each node, we maintain a list 
of "useful" labels corresponding to the routes that go to that node. 
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Starting from the source node, we keep creating new partial routes 
by extending the current "best" partial route to generate new la- 
bels, until all the potentially useful labels on the target node are 
generated. Finally, the route represented by the label with the best 
objective score at the target node is returned. 

We prove that the algorithm returns routes with objective scores 
no worse than times of that of the optimal route. The worst 
case complexity of OSScaling is polynomial with -, the budget 
constraint A, the number of edges and nodes in Q, and it is expo- 
nential in the number of query keywords, which is usually small 
in our targeted applications, as it is well known that search engine 
queries are short, and an analysis on a large Map query log [25] 
shows that nearly all queries contain fewer than 5 words. 

Our second algorithm improves on the algorithm OSScaling, 
which is referred to as BucketBound. It also returns approximate 
solutions to KOR queries with performance guarantees. However, 
it is more efficient than OSScaling. The algorithm can always re- 
turn a route whose objective score is at most /?(/?> 1 is a param- 
eter) times of the one found by OSScaling. The algorithm divides 
the traversed partial routes into different "buckets" according to the 
best possible objective scores they can achieve. This enables us to 
develop a novel way to detect if a feasible route (covering all query 
keywords and satisfying the budget constraint) is in the same bucket 
with the one found by OSScaling. When we find a feasible route 
that falls in the same bucket as the route found by OSScaling, we 
return it as the result. 

Finally, we also present a greedy approach for the problem. From 
the starting location, we keep selecting the next location greedily, 
taking into account all the three constraints in the KOR query. This 
is repeated until we reach the target location. This algorithm is 
efficient, although it may generate a route that violates the two hard 
constraints of KOR: covering all query keywords and satisfying the 
budget constraint. 

In summary, our contributions are threefold. First, we propose 
the keyword-aware optimal route (KOR) query, and we show that 
the problem of solving KOR queries is NP-hard. Second, we present 
two novel approximation algorithms both with provable performance 
bounds for the KOR problem. We also provide a greedy approach. 
Third, we study the properties of the paper's proposals empirically 
on a graph extracted from a large collection of Flickr photos. The 
results demonstrate that the proposed solutions offer scalability and 
excellent performance. 

The rest of the paper is organized as follows: Section 2 formally 
defines the problem and establishes the computational complexities 
of the problem. Section 3 presents the proposed algorithms. We 
report on the empirical studies in Section 4. Finally, we cover the 
related work in Section 5 and offer conclusions in Section 6. 

2. PROBLEM STATEMENT 

We define the problem of the keyword-aware optimal route (KOR) 
query, and show the hardness of the problem. 

Definition 1: Graph. A graph Q — (V,E) consists of a set of 
nodes V and a set of edges E C V x V . Each node v £ V repre- 
sents a location associated with a set of keywords denoted by v.ip; 
each edge in E represents a directed route between two locations 
in V, and the edge from Vi to vj is represented by (Vi, Vj). □ 

We define Q as a general graph. It can be a road network graph, 
or a graph extracted from users' historical trajectories. Depending 
on the source of Q, each edge in Q is associated with different types 
of attributes. For example, if Q is a traffic network, the attributes 
can be travel duration, travel distance, popularity, and travel cost. 
To keep our discussion simple, we consider directed graphs only in 



this paper. However, our discussion can be extended to undirected 
graphs straightforwardly. 

Definition 2: Route. A route R = (yo, Wi, v-n) is a path such 
that R goes through vq to ii„ sequentially, following the relevant 
edges in Q. □ 

We define the optimal route based on two attributes on each edge 
(Vi, Vj): 1) one attribute is used as the objective value of this edge, 
and it is denoted by o(t> 4 , Vj ) (e.g., the popularity), and 2) the other 
attribute is used as the budget value of this edge, which is denoted 
by b(vi,Vj) (e.g., the travel time). Note that we can pick up any 
two attributes to define the optimal route depending on different 
applications. 

Definition 3: Objective Score and Budget Score. Given a route 
R = {vo, vi, v n ), the objective score of R is defined as the sum 
of the objective values of all the edges in R, i.e., 



OS(R) =Y^o(vi-i,Vi), 



and the budget score is defined as the sum of the budget values of 
all the edges in R, i.e., 



BS{R) =£l>(wi_i,t>0. 
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Figure 1: Example of Q 

Figure 1 shows an example of the graph Q. We consider only five 
keywords (ti-ts), and each keyword is represented by a distinct 
shape. For simplicity, each node contains a single keyword in the 
example. On each edge, the score inside a bracket is the budget 
value, and the other number is the objective value. For example, 
given the route R — {vo,V3, V5, Vr), we have OS(R) = 2 + 3 + 4 = 
9 and BS(R) = 2 + 2 + 1 = 5. 

Intuitively, a keyword-aware optimal route (KOR) query is to 
find an optimal route from a source to a target in a graph such that 
the route covers all the query keywords, its budget score satisfies a 
given constraint, and its objective score is optimized. Formally, we 
define the KOR query as follows: 

Definition 4: Keyword-aware Optimal Route (KOR) Query. 

Given Q, the keyword-aware optimal route query Q={v s ,Vt,ip, A), 
where v 3 is the source location, v t is the target location, tp is a set of 
keywords, and A specifies the budget limit, aims to find the route 
R starting from v s and ending at v t (i.e.,(i) s , ■ • • , vt)) such that 

R — arg minfl OS(R) 
subject to ip C \J veR (v.ip) 
BS(R) < A 
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In the example graph in Figure 1 , given a query Q = (vo , 1)7, {ti , 
t2,ts}, 8}, the optimal route is R opt = (vo, i>3, V4,, vt) with objec- 
tive score OS(Ro P t) = 4 and budget score BS(R op t) = 7. If we 
set A to 6, the optimal route becomes R op t = (vo, V3, Vs,Vr} with 
OS{R opt ) = 9 and BS(R opt ) = 5. 

Theorem 1: The problem of solving KOR queries is NP-hard. 

Proof Sketch: This problem can be reduced from the NP-hard 
weight-constrained shortest path problem (WCSPP) [10]. Given a 
graph in which each edge has a length and a weight, WCSPP finds 
a path that has the shortest length with the total weight not exceed- 
ing a specified value. The problem of answering KOR queries is a 
generalization of WCSPP. If each node already covers all the query 
keywords, the problem of solving KOR becomes equivalent to the 
WCSPP. □ 

Obviously, if we disregard the query keyword constraint, the 
problem of solving KOR becomes WCSPP. In addition, if we re- 
move the budget constraint, the problem becomes similar to the 
generalized traveling salesman problem (GTSP) [11], which is also 
NP-hard. In GTSP, the nodes of a graph are clustered into groups, 
and GTSP finds a path starting and ending at two specified nodes 
such that it goes through each group exactly once and has the small- 
est length. In the problem of solving KOR, we can extract the 
locations whose keywords overlap with ip, and the locations that 
cover the same keyword form a group. Thus, the problem of solv- 
ing KOR without the budget constraint is equivalent to the GTSP. 
Furthermore, if we disregard the objective score, the problem of 
finding a route that covers all the query keywords and satisfies the 
budget constraint is still intractable. It is obvious that the simpli- 
fied problem is also equivalent to GTSP, and thus cannot be solved 
by polynomial-time algorithms. Many approaches have been pro- 
posed for solving GTSP and WCSPP (e.g., [5, 7, 8, 23]. However, 
they cannot be applied to answer the KOR queries since one more 
constraint or objective must be satisfied in KOR compared with 
GSTP and WCSPP. 

In the KOR problem, we consider two hard constraints, namely, 
the keyword coverage and the budget limit, and aim to minimize 
the objective score. The simplified versions that consider any two 
aspects are also NP-hard as we analyzed. Hence, it is challenging 
to find an efficient solution to answering KOR queries. If a route 
satisfies the two hard constraints, the route is called a. feasible solu- 
tion or & feasible route. 

Furthermore, we can extend the KOR query to the keyword- 
aware top-k route (KfcR) query. Instead of finding the optimal route 
defined in KOR, the KfcR query is to return k routes starting and 
ending at the given locations such that they have the smallest objec- 
tive scores, cover the query keywords, and satisfy the given budget 
constraint. 

3. ALGORITHMS 

We present the pre-processing method in Section 3.1, the pro- 
posed approximation algorithm OSScaling with provable approx- 
imation bound in Section 3.2, the more efficient approximation al- 
gorithm BucketBound also with performance guarantee in Sec- 
tion 3.3, and the greedy algorithm Greedy in Section 3.4. 

3.1 Pre-processing 

We introduce the pre-processing method. We utilize the pre- 
processing results to accelerate the algorithms to be proposed. 

We use the Floyd-Warshall algorithm [9], which is a well-known 
algorithm for finding all pairs shortest path, to find the following 
two paths for each pair of nodes (t)i ,Vj): 



• Tij : the path with the smallest objective score. The objective 
score of this path is denoted by OS(ri.j) and the budget score 
is denoted by BS(rij). 

• (Tij: the path with the smallest budget score. The objective 
score of aij is denoted by OS(oi.j) and the budget score is 
denoted by BS(oij). 

For example, after the pre-processing, for the pair of node (yo , f 7) 
in Figure 1, we have to, 7 = {vo, t>3, V4, v-j) with OS(ro,7) = 4 and 
BS(r . 7 ) = 7 and 00,7 = {vo, «3 ; ^5, W7) with OS(ct .7) = 9 and 
BS(<ro',7) = 5. 

Only the objective and budget scores of nj and <Jij are used in 
the proposed algorithms, while the two paths themselves are not. 
The space cost is 0(|V| 2 ), where \V\ represents the number of 
nodes in the graph. In general, the number of points of interests 
I V\ within a city is not large [15, 19]. 

We use an inverted file to organize the word information of nodes. 
An inverted file index has two main components: 1) A vocabulary 
of all distinct words appearing in the descriptions of nodes (loca- 
tions), and 2) A posting list for each word t that is a sequence of 
identifiers of the nodes whose descriptions contain t. We use B + - 
tree for the inverted file index, which is disk resident. 

3.2 Approximation Algorithm OSScaling 

A brute-force approach to solving KOR is to do an exhaustive 
search: We enumerate all candidate paths from the source node. 
We can use a queue to store the partial paths. In each step, we 
select one partial path from the queue. Then it is extended to gen- 
erate more candidate partial paths and those paths whose budget 
scores are smaller than the specified limit are enqueued. When a 
path is extended to the target node, we check whether it covers all 
the query keywords and satisfies the budget constraint. We record 
all the feasible routes, and after all the candidate routes from the 
source node to the target node have been checked, we select the 
best one of all the feasible routes as the answer to the query. 

However, the exhaustive search is computationally prohibitive. 
Given a query with a specified budget limit A, we know that the 
number of edges in a route exploited in the search is at most |_ b A J , 
where b m in is the smallest budget value of all edges in Q. Thus, the 

I A 1 

complexity of an exhaustive search is 0(d l ™» ), where d is the 
maximum outdegree in Q (notice that enumerating all the simple 
paths is not enough for answering KOR queries). To avoid the ex- 
pensive exhaustive search, we devise a novel approximation algo- 
rithm OSScaling. It is challenging to develop such an algorithm. 

The main problem of the brute-force approach is that too many 
partial paths need to be stored on each node. In order to reduce 
the cost of enumerating the partial paths, in OSScaling, we scale 
the objective values of edges in Q into integers utilizing a parame- 
ter e. The scaling enables us to bound the number of partial paths 
explored, and further to design a novel algorithm that runs poly- 
nomially in the budget constraint A, =, the number of nodes and 
edges in Q, and is exponential in the number of query keywords 
(which is typically small). Furthermore, the objective score scaling 
guarantees that the algorithm always returns a route whose objec- 
tive score is no more than times of that of the optimal route, if 
there exists one. This is inspired by the FPTAS (fully polynomial- 
time approximation scheme) for solving the well-known knapsack 
problem [24]. Note that the problem of answering KOR queries 
is different from the NP-hard problem knapsack and its solutions 
cannot be used. 

We define a scaling factor 8 = £ °""^'""" , where o m i„ and 
bmin represent the smallest objective value and the smallest budget 
value of all edges in Q, respectively, and e is a parameter in the 
range (0, 1). Next, for each edge (vi,Vj), we scale its objective 
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value o(vi, Vj) to 6(vi, Vj) = [ ° j . We call the graph with 
scaled objective values as the scaled graph, denoted by Qs- Given 
a route R — {vo, vi, ...,v n ) in Qs, we denote its scaled objective 

score by OS(R) = Yn=i 6(^-1, «<)• 

On the scaled graph, we still extend from the source node to cre- 
ate new partial paths until we reach the target node. However, if 
a partial path has both smaller scaled objective score and budget 
score than another one on the same node, the OSScaling algo- 
rithm ignores it. Before detailing the algorithm, we introduce the 
following important definitions. 

Definition 5: Node Label. For each node Vi, we maintain a list 
of labels, in which each label corresponds to a path Pf from the 
source node v s to node Vi, The label is denoted by L\ and is in 
format of (A, OS, OS, BS), where L^.X is the keywords covered 
by Pf, L^.OS, Li.OS, and L^.BS represent the scaled objec- 
tive score, the original objective score, and the budget score of P/\ 
respectively. □ 

Example 1: In the example graph shown in Figure 1 , assuming A = 
10 and e = 0.5, we can compute the value for 9: 6 = u,5 * i '" 1 ^*° T "'" 
= . Therefore, the objective value of each edge is scaled to 20 
times of its original value. Given the two paths from vo to V4, i.e., 

Ri = (vo,V2,V3,V4) and R2 = {vo, V2,V6,V5,Va). The label of 
Ri is L° = ((<i,< 2> *4),100,5,7) and the label of R 2 is L\ = 
((ti,t 2) t4>,120 ) 6,ll). □ 

Each partial route is represented by a node label. At each node, 
we maintain a list of labels, each of which stores the information 
of a corresponding partial route from the source node to this node, 
including the query keywords already covered, the scaled objective 
score, the original objective score, and the budget score of the par- 
tial route. Many paths between two nodes may exist, and thus each 
node may be associated with a large number of labels. However, 
most of the labels are not necessary for answering KOR. Consider- 
ing Example 1, at node W4, the label L\ could be ignored since L% 
has both smaller objective and budget scores. This is because that 
in the route extended from L\, we can always replace the partial 
route corresponding to L\ with that corresponding to label L\. We 
say that L\ dominates L\: 

Definition 6: Label Domination. Let L\ and L\ be two labels 
corresponding to two different paths from the source node v s to 
node Vi. We say L\ dominates L\ iff Lf.X D L\.\, L^.OS < 
L\.OS, and Lf.BS < L\.BS. □ 

Notice that in OSScaling we determine if a label dominates an- 
other one with regard to the scaled objective score instead of the 
original objective score. Therefore, it is likely that the label domi- 
nated has smaller original objective score, and hence the optimal 
route may be missed in this algorithm. This is the reason that 
OSScaling can only return approximate results. However, by do- 
ing so, the maximum number of labels on a node is bounded, which 
further bounds the complexity of OSScaling. We have the follow- 
ing lemma: 

Lemma 1: On a node there are at most 2 m [^— J L eo ° TOa g A . J 
labels, where m is the number of query keywords, e is the scal- 
ing parameter, b min , o max , and o min represent the smallest budget 
value, the largest objective value, and the smallest objective value 
of all edges in Q, respectively. 

Proof Sketch: First, given m query keywords, there are at most 
2 m keywords subset. Second, given the budget limit A, the num- 
ber of edges in a route checked by our algorithm does not exceed 
[ b A j . Hence, the objective score of a route in Qs is bounded 



conclusion, we only need to store at most 2 m [ b A j |_ 0m °g A J 
labels, because all the rest can be dominated by them. □ 

Note that Lemma 1 gives an upper bound of the label number at 
a node. In practice, the number of labels maintained at a node is 
usually much smaller than this upper bound. We denote this upper 
bound by L max , 

Next, we introduce how to do the route extension using labels. 
This step is called label treatment: 

Definition 7: Label Treatment. Given a label L\ at node Vi, for 
each outgoing neighbor Vj of node Vi in Q, we create a new la- 
bel for Vj : L) = (LtXUvj.iP, LtOS + o{v,,Vj), L^.OS + 
o(vi,v d ),LtBS + b{vi,vj)). □ 

The label treatment step extends a partial route at node v; for- 
ward to all the outgoing neighbor nodes of Vi, and thus more longer 
partial routes are generated. Note that the label treatment step is ap- 
plied together with label domination checking. 

Another important definition is how we compare the order of two 
labels: 

Definition 8: Label Order. Let L\ and be two labels cor- 
responding to two paths from source node v 3 to node Vi and Vj 
(Vi and Vj can be either the same or different nodes), respectively. 
We say L\ has a lower order than Lj, denoted by L\ -< L), iff 
\L$.\\ > |i*-.A| or(|Lf.A| = |L^.A| and L^OS < L t J .dS)or 
(|L*\A| = \L).\\, LtOS = L).6S, and L k { .BS < L).BS); 
otherwise, breaking the tie by alphabetical order of Vi and Vj. □ 

In Example 1, we say that L" -< L\, because they contain the 
same number of query keywords, and L\ has smaller objective and 
budget scores. This definition decides which partial route is se- 
lected for extension in each step. 

Now we are ready to present our algorithms. The basic idea is 
to keep creating new partial routes from the best one among all ex- 
isting partial routes. From the viewpoint of node labels, we first 
create a label at the source node, and then we keep generating new 
labels that cannot be dominated by existing ones. We always select 
the one with the smallest order according to Definition 8 to gener- 
ate new labels. If newly generated labels cannot be dominated by 
existing labels, they are used to detect and delete the labels domi- 
nated by them. We repeat this procedure until all the labels on the 
target node are generated, and finally the label with the best objec- 
tive score satisfying the budget limit at the target node is returned. 
Note that this is not an exhaustive search algorithm and we will 
analyze the complexity after presenting the algorithm. 

The pseudocode is presented in Algorithm 1. We use a min- 
priority queue Q to organize the labels, which are enqueued into Q 
according to their orders defined in Definition 8. We use variable 
U to keep track of the upper bound of the objective score, and use 
LL to store the last label of the current best route. We initialize U 
as 00, and set LL as NU LL. We create a label at the starting node 
v a and enqueue it into Q (lines 2-4). 

We keep dequeuing labels from Q until Q becomes empty (lines 
5-20). We terminate the algorithm when Q is empty or when all the 
labels in Q has objective scores larger than U. In each while-loop, 
we first dequeue a label L k with the minimum label order from Q 
(line 6). If the objective score of L h ' plus the best objective score 
OS(ri, t ) from to the target node v t is larger than the current up- 
per bound U, then the label definitely cannot contribute to the final 
result (line 7). Next, for each outgoing neighbor Vj of Vi, we create 
a new label L l j for it according to Definition 7 (line 9). If can be 
dominated by other labels on the node Vj or if it cannot generate a 
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Algorithm 1: OSScaling Algorithm 



1 Initialize a min-priority queue Q; 

2 U <— oo; LL -f— NULL; 

3 At node u s , create a label: L° <— (v s .il>, 0, 0, 0) 

4 Q.enqueue(Lg); 

5 while Q is not empty do 
6 



7 
8 
9 

10 



11 

12 
13 
14 
15 
16 
17 
18 
19 
20 



<— Q.dequeue(); 
if L^.OS + OS(t m ) > U then continue; 

for ec/c/i edge (vi , «j ) do 

Create a label L l - for Vj : L l . <- (Lf . A |J vj 4, L\ .OS 4 
6(vi, Vj), L^.OS + o(vi,Vj), L^.BS + b(vi,Vj)); 
if Lj is not dominated by other labels on Vj and 
L l y BS + BS(<Tj, t ) < A and Lj.. OS + OS(rj, t ) < U 
then 

if Lj does not cover all the query keywords then 
Q.enqueue(L^); 



if L is dominated by then 
remove L from Q; 



else 



HL l r BS + BS(r jjt ) < A then 
U <- L l r OS + OS(r jit ); 
LL^L<; 
else Q.enqueue(Lj. ); 



21 if U is oo then return "No feasible route exits" 

22 else Obtain the route utilizing LL and return it; 



Example 2: Consider the example graph in Figure 1, the query 
Q = {vq,V7, {ti,t2}, 10), and e is set as 0.5. The steps of the 
algorithm are shown in Figure 2 and the contents of the labels gen- 
erated are in Table 1 . 

Initially, we create a label Lq=(0, 0, 0, 0) at node Vq and enqueue 
it into Q. After we dequeue it from Q, as shown in step (a), we 
generate the following three labels on all the outgoing neighbors of 
«o: Li, L%, and L%. The three labels are also enqueued into Q. 

In the next loop, L% is selected because L% -< L% -< L\. As 
shown in Step (b), we generate another two labels L\ and L%. Note 
that the best budget score from ve to V7 is 7 (BS(o"6,7)=7), and thus 
L 6 can be ignored since L%.BS+BS(a G:7 ) (=11)> A. According 
to the pre-processing results, OS(t3,7)=2 and BS(r3 i 7) =5. There- 
fore, in step (c), we get a feasible route R 1 = {vo, V2,v^,V4, v-j) 
with OS(i?i) =6 and BS(i?i)=10. The upper bound U is updated 
asOS(i?i),i.e., U=6. 

Next, L3 on node vg is selected. As shown in Step (d), we gen- 
erate another three labels and enqueue them into Q: L\, L%, and 
L%. Now label L% already covers all the query keywords on 115. 
According to the pre-processing results, from vg, to v?, the best ob- 
jective score is 3 (OS(rs i 7)=3) and the budget score of this path 
is 4. Utilizing the pre-processing results, as shown in step (e), we 
can obtain another feasible solution R2 = («o, V3, Ws, V4, 1)7} with 
OS(R 2 )=S and BS(i? 2 )(=8) < A (Note that suppose A=7 in Q, 
R2 will not be a feasible result. Instead, we enqueue the label L5 
into Q, and in the next loop, we include the edge (115, V7) and get a 
feasible route (urj, W3, V5, U7)). 

The rest labels are treated similarly, and the best route is □ 



feasible route (first, the budget score of Lj plus BS^^t), the best 
budget score to Vt, is larger than the budget constraint A; second, 
the objective score of Lj plus OS(r^t), the best objective score to 
Vt, is larger than the current upper bound U), we ignore the new la- 
bel (line 10); Otherwise, if it does not cover all the query keywords, 
we enqueue it into Q and use it to detect and delete the labels that 
are dominated by it on Vj (lines 1 1-15). 

When we find that the current label L l j already covers all the 
query keywords, a feasible solution is found and we update the 
upper bound U (lines 16-20). First, if the budget score of L l j plus 
the budget score of Tj t t (the path with the best objective score from 
Vj to vt) is smaller than U, we update the upper bound U, and the 
last label is also updated (lines 18-19); otherwise, we enqueue this 
label into Q for later processing. Finally, if U is never updated, 
we know that there exists no feasible route for the given query; 
otherwise, we can construct the route using the label LL (lines 21- 
22). 

The following example illustrates how this algorithm works. 
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Table 1: Labels contents 

Complexity: In each loop of OSScaling, we dequeue one label 
from Q. Thus, in the worst case we need | V\L max loops according 
to Lemma 1. Within one loop, 1) we generate new labels on a 
node and check the domination on its outgoing neighbors, taking 
0(\E\L max ) time by aggregate analysis; 2) we dequeue one label 
and the complexity is 0(lgL max ). Hence, we can conclude that 
the worst time complexity is 0(\V\LmaxlgL max + \E\L max )). In 
practice, the number of loop is much smaller than the worst case 
and the number of keywords of a query is quite small. Therefore, 
the algorithm OSScaling is able to return the result efficiently. 

By scaling the objective values of edges in Q, the algorithm 
OSScaling is able to guarantee an approximation bound. 
Approximation Bound: We denote the route found by OSScaling 
as Ros, and the feasible route with the smallest scaled objective 
score in Qs as Rg s . We have the following lemma: 

Lemma 2: OS(7? 6s ) > OS(Ros)- 

Proof Sketch: In Algorithm 1, if we use the partial route with the 
smallest scaled objective score to update the upper bound at node 
Vj (line 18), the algorithm returns Rg s . We denote the objective 
score of a route from v p to v q as O p , q , and we know O s ,j (Rg s ) = 
O s ,j(Ros)- According to the algorithm, Oj,t(Rg s ) > Tj,t = 
Oj, t (Ros), and thus OS(Rg s ) = O s ,j(Rg s ) + O jtt (Rg s ) > 
s ,j(Ros) + Oj, t (Ros) = OS(Ros)- □ 

We denote the optimal route as R opt - We have: 
Theorem 2: OS(R op t) > (1 - e)OS(Ros)- 
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Because X^e'efl ° e ' — °min> we can conclude that OS(R op t) > 
(1 - ^E^V = (1 - e)OS(ife B ) > (1 - e )OS(i?os) 
(according to Lemma 2). □ 

We can see that the parameter e affects not only the running time 
of this algorithm but also the accuracy. There is a tradeoff between 
the efficiency and accuracy when selecting a value for e. With a 
larger value of e, OSScaling runs faster but the accuracy would 
drop; on the contrary, with a smaller value for e we can obtain 
better routes but that needs longer query time. 
Optimization: We design the following optimization strategies to 
further improve Algorithm 1 . 

Optimization Strategy 1: When processing a label L\ at node 
Vi, in addition to the labels generated by following the outgoing 
edges of Vi in the graph, we also generate a label on a node Vj such 
that BS(ai.j) has the smallest value among all the nodes containing 
a uncovered query keyword and Lf.BS + BS(a i:j ) + BS(oj, t ) < 
A. The motivation of this strategy is to find a feasible solution as 
early as possible, and then it is used to update the upper bound and 
further to prune more labels. 

Optimization Strategy 2: When the query contains some very 
infrequent words, we can utilize the nodes that contain them to find 
the result more efficiently. In Algorithm 1, when we decide if a 
label L\ can be deleted, two specific conditions are checked: 1) 
if L\.OS + OS(n, t ) is smaller than U; 2) if L*.BS + BS(ffi, t ) 
is smaller than A. We utilize the scores of the two pre-processed 
routes from Vi to the target node v t . But if the path from Vi to the 
nodes containing the infrequent words have large objective or bud- 
get scores, we will waste a lot of time on extending the route from 
Vi. The reason is that, although the label L\ cannot be pruned by 
the two conditions, it cannot generate useful labels, and this is not 
known until we reach the nodes containing the infrequent words. 
We first obtain all the nodes containing the least infrequent word 
(which must be below a frequency threshold, such as appearing 
in less than 1% nodes) utilizing the inverted file; after we gener- 
ate a label L* , if it does not cover the least infrequent word, for 
each node I, we check two conditions: 1) L^.OS + OS(tu) + 
OS(t m ) > U; 2) Lf.BS+BS((n,i) + BS(cr M ) > A. If on each 
node containing infrequent words at least one condition is satisfied, 
this label can be discarded. 

3.3 Approximation Algorithm BucketBound 

In the algorithm OSScaling, after we find a feasible solution, 
we still have to keep searching for a better route until all the fea- 
sible routes are checked. We propose a more efficient approximate 
method denoted by BucketBound with provable approximation 
bounds which is also based on scaling the objective scores into in- 
tegers. 

Before describing the proposed algorithm, we introduce the fol- 
lowing lemma which lays a foundation of this algorithm. 

Lemma 3: Given a label L\ at node Vi, the best possible objective 
score of the feasible routes that could be extended from the partial 
path represented by L\ is .OS + OS(ri.i). We denote the score 
by LOW(L l fc ). 



Proof Sketch: If Ti,t and L t cover all query keywords collectively, 
they constitute the best route extending from L\ and its objective 
score is equal to L 1 " .OS + OS(Ti it ). Otherwise, another route from 
Vi to v t covering more keywords must be selected to construct a 
feasible route. This route has larger objective score than that of 
n t t, which results in a larger objective score of the final route. □ 

In this algorithm, we divide the traversed partial routes into dif- 
ferent "buckets" according to their best possible objective scores. 
We define the buckets as follows: 

Definition 9: Label Buckets. The label buckets organize labels. 
Each bucket is associated with an order number and corresponds to 
an objective score interval — the rth bucket B r corresponds to the 
following interval: [/3 r OS(r Sjt ), /3 r+1 OS(r Sit )), where OS(r Sjt ) 
is the best objective score from v s to vt and j3 is a specified param- 
eter. A label is in the bucket B r if: 

/TOS(t m ) < LOWCL?) < p r+1 0S(r 3 , t ) 



With this important definition, we proceed to present the approx- 
imation algorithm BucketBound. We denote the route found by 
OSScaling as Ros- The basic idea is as follows: We keep select- 
ing labels (partial routes) from the buckets. When selecting a label, 
we always choose the non-empty bucket with the smallest order 
number, and then select a label with the lowest label order from it. 
After a label L* is generated, we compute the score LOW(L*) and 
we place this label to the corresponding bucket according to Defi- 
nition 9. Utilizing the label buckets enables us to find a novel way 
to detect if a feasible route found is in the same bucket as Ros- If 
we find such a route during the above procedure, we return it as the 
result. We denote the route found by BucketBound as Rbb- 

We proceed to explain how to determine if the bucket where we 
find a feasible route contains Ros- 

This algorithm follows the basic label generation and selection 
approach in OSScaling. However, the strategies of generating and 
selecting labels are different. With such changed label generation 
and selection strategies, we have the following lemma: 

Lemma 4: If all the buckets Bi(i — 0, ...,r) are empty and no 
feasible solution is found yet, the objective score of Ros satisfies: 
OS(Ros) > /T +1 OS(r a , t ). 

Proof Sketch: Since any bucket Bi (i < r) is empty, we know the 
label corresponding to Ros must be selected from the subsequent 
buckets. Therefore, LOW(L^) > /3 r+1 OS(r s , t ). According to 
Lemma 3, we know OS(Ros) > LOW(L^) > /3 r+1 OS(r s , t ). 

□ 

Based on Lemma 4, we have Lemma 5. When the condition in 
Lemma 5 is satisfied, a feasible route and Ros fall into the same 
bucket, and the algorithm terminates. 

Lemma 5: When a feasible route Rb b is found in the bucket B r +i 
and all the buckets Bo, £>i, B r are empty, the route Ros found 
by OSScaling is also contained in B r +i. 

Proof Sketch: Because any bucket Bi(i < r) is empty, according 
to Lemma 4, 0S{R O s) > /3 r+1 OS(r s , t ). Since OS(Ros) < 
OS(Rbb) (Rbb is one feasible solution found in OSScaling), we 
know/3 r+1 OS(r s ,i) < OS(7? OS ) < OS(R BB ) < /? r+2 OS(r s , t ). 
According to Definition 9, -Ros also falls in B r +\. □ 

Figure 3 illustrates the basic process of the proposed approxi- 
mation algorithm BucketBound. As shown in the figure, we first 
select the label L\ from the bucket Bo, and after the label treatment 
the new label is put into the bucket B3. Since Bo becomes empty 
now, we proceed to select labels from B\. If Bo, Bi, and B2 all 
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Figure 3: Process of Algorithm 2 

become empty, according to Lemma 4 we can know OS(Ros) > 
/3 3 OS(r s , t ). If now we find a feasible route Rbb in the bucket B- A , 
according to Lemma 5, it is assured that i?os also falls into Bs, 
and we return Rbb us the result. 

Unlike Algorithm 1, the approximation algorithm terminates im- 
mediately when Lemma 5 is satisfied, which means a feasible so- 
lution is found. Note that the feasible solution may be different 
from the first feasible solution found by Algorithm 1. This algo- 
rithm is also capable of determining if a feasible route exists. If all 
buckets are empty during the label selection step and no feasible 
route found yet, there exists no result for KOR. This is because 
that when all buckets are empty, all the labels generated do not sat- 
isfy the budget constraint, which means that all the partial routes 
generated from the source node exceed the budget limit A. 



Algorithm 2: BucketBound Algorithm 

1 Initialize a min-priority queue Bo ; 

2 Uf- NULL; Found <- false; 

3 At node v s , create label L® «— (v s .if), 0, 0, 0); 

4 _Bo enqueue(Lg); 

5 while Found is false do 

6 B r <— the queue of the first non-empty bucket; 

7 if All queues are empty then return "No feasible route exist" 
L k <— B r .dequeue(); 
for each edge (vi, Vj) do 

Create a new label L\ for V4 
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j ™ j ■ 

(L k .\ |J Vj.ifa, L k .OS + 6(vi,Vj), L k .OS + 
o{v i ,v j ),L } l.BS + b{vuv j )); 
if Lj is not dominated by other labels on Vj and 



LK.BS 



BS(o-j jt ) < A then 
Find B s that L l } falls into; 
if B s does not exist then 

Initialize a priority queue B a ; 
_B s .enqueue(L^); 



if L is dominated by L\ then 

remove L from the corresponding queue; 
if Lj covers all the query keywords then 
if B r and B s are the same queue then 
itLh.BS + BSfo.t) < A then 
Found <— true;/ / Lemma 5 



LL 



24 Obtain the route utilizing LL and return the route; 

The algorithm is detailed in Algorithm 2. It uses a min-priority 
queue for each bucket to organize the labels in the bucket. We ini- 
tialize the first min-priority queue Bo (corresponding to the first 
bucket with boundary [OS(r s . t ), /?OS(r s .t))); U and LL are ini- 
tialized as in Algorithm 1. We initialize the flag Found as false, 
which records if a feasible route is found. We create a label at the 
source node v s and enqueue it into Bo (lines 1-4). The algorithm 



terminates when the flag Found is true. We keep dequeuing labels 
from B r which represents the non-empty bucket with the smallest 
order number until we find a solution or no result exists(lines 5— 
23). If all queues become empty, it is assured that no feasible route 
exists (line 7). After we select a label L k on node Vi, for each out- 
going neighbor Vj of Vi, we create a new label for it (line 10). When 
a new label Lj is generated, we check: 1) if it can be dominated 
by other labels on Vj ; 2) if it cannot generate results definitely. If 
so, we ignore it (line 11); Otherwise, we use it to delete labels on 
Vj that can be dominated by it, and we enqueue this label to the 
corresponding bucket according to its best possible objective score 
(lines 12-18). When L) already covers all the query keywords and 
also falls into B r , we still need to test if the path corresponding to 
LOW(I/j ) satisfies the budget constraint. If so, we find a solution 
and exit the loop according to Lemma 5 (lines 19-23). 

Theorem 3: Algorithm 2 offers the approximation ratio . Proof 

Sketch: Assume that the solution Rbb is found in Bk- According 
to Lemma 5, the route found by OSScaling Ros ls also contained 
in B k . Thus, we have 0S{R O s) > f3 k OS(T st ) and OS(Rbb) < 
f3 k+1 OS(T st ). According to Theorem 2, we can get: %| ( ,o BB j 



OS(R BB ) OS(R OS ) 



os(Ros) os(ft opt ) < £fcos(T 5t )(r- 6 ) 
BucketBound has the same worst case complexity as Algorithm 1, 
it processes much fewer labels and is more efficient in practice. 
Note that the two optimization strategies in OSScaling are still 
applicable in BucketBound. 

3.4 Greedy Algorithm 

We propose an approximation algorithm using the greedy ap- 
proach to solve KOR. It has no performance guarantee. 

There are three constraints in the KOR problem: a) a set of 
keywords must be covered; b) the objective score must be mini- 
mized; c) the budget limit A must be satisfied. As discussed in 
Section 2, by considering only two of them, the problem is still 
NP-hard. Therefore, a greedy approach normally cannot grantee 
that two constraints are satisfied. Since the keyword and budget 
constraints are hard constraint, we design a greedy algorithm such 
that it is able to find a route either covering all the query keywords 
or satisfying the budget constraint, while minimizing the objective 
score greedily. 

The idea is that we start from the source node, and keep selecting 
the next best node according to a certain strategy until we finally 
reach the target node. The strategy of selecting the next node affects 
the results significantly. We design a greedy strategy that takes 
into account all the three constraints simultaneously to find the best 
next node: a) the node contains uncovered query keywords; and 
b) the best route that can be generated after including this node into 
the current partial route is expected to have a small objective score 
and fulfill the budget constraint. We use a parameter a to balance 
the importance of the objective and budget scores when selecting 
a node: at node Vi, when we extend the current partial route Ri 
ending at Vi, we select the node Vj that minimizes the following 
score: 



,fl fc + 1 OS(T st ) 



OS(R opt ) 

□Although 



scorefe, Ri) = a(R l .OS + OS(r l , J ) + OSfo,*)) 

+ (1 - a){R,.BS + BS(Ti,j) + BSfa.t)) 



(1) 



When a = 0, we select a node only based on the budget score, i.e., 
selecting the node such that the budget score of the corresponding 
partial route plus the best budget score from the node to the target 
node v t is the smallest. When a = 1, the algorithm finds a node 
such that the objective score of the corresponding partial route plus 
the best objective score from the node to vt is minimized. 
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Algorithm 3: Greedy Algorithm 



1 nodeSet <— 0; wordSet <— Q.f! \ v s .ip; 
■ OS <- 0; BS <- 0; 

3 for eoc/i word iut £ wordSet do 

4 Get the location set /Set containing uit ; 

5 nodeSet <— nodeSet (J iSet; 

6 while wordSet is not empty do 

7 minS <- argmin Vm£node3et score(v m , Rp re ); 

8 OS <- OS + OS(r pre , m ); BS <- BS + BS(r pre , m ); 

10 wordSet <— wordSet \ v m .ip; 

11 Remove the locations containing v m .ip from nodeSet; 

12 OS -f- OS + OS(r pre , t ); -BS <— BS + BS(r preit ); 

13 Return the route found with scores OS and BS; 



The pseudocode is outlined in Algorithm 3. We use wordSet to 
keep track of the uncovered query keywords and nodeSet to store 
all the locations containing uncovered query keywords (line 1). 
Vpre denotes the node where the current partial path ends and is 
initialized as v s . OS and BS are used to store the objective and 
budget scores and both initialized to (line 2). We utilize inverted 
file to find locations for nodeSet (lines 3-5). The algorithm termi- 
nates when wordSet is empty. While it is not empty, we find the 
best node according to Equation 1 (line 7), extend the partial route 
(line 8-9), and update wordSet and nodeSet (lines 10-11). After 
we exit the loop, we add the last segment from the partial route's 
last node to the target node v t to construct the final route and return 
(lines 12-13). 

Algorithm 3 may fail to find a feasible route even if there ex- 
its a feasible one. In each step, it selects the next best node. If 
we find more nodes at each step, the accuracy will be better while 
the search space becomes much larger. Hence, it is a tradeoff be- 
tween the accuracy and efficiency. In the experiments, we study the 
performance of Algorithm 3 when the best 2 nodes are selected at 
each step. We denote the algorithm selecting one node by Greedy- 
1, and the algorithm selecting two nodes by Greedy-2. The worst 
time complexity of Greedy-1 is O(mn) and for Greedy-2 it is 
0(2 m n), where m is the number of query keywords and n is the 
number of nodes in the graph. 

Algorithm 3 guarantees that the query keywords are always cov- 
ered while the budget limit may not be satisfied. This is desirable 
when the query keywords are important to users (e.g., the users do 
not want to miss any type of locations in their plan). However, if 
the budget score is very important (e.g., the users cannot overrun 
their money budget), we modify this algorithm slightly to accom- 
modate the need. We return a route with budget score not exceed- 
ing A while the query keywords may not be totally covered. We 
break the while-loop when the current partial route cannot be ex- 
tended any more. That is, in line 6 in Algorithm 3, we check if 
L.BS + BS(o r ;,t) > A instead of if wordSet is empty. 

3.5 Keyword-aware Top-fc Optimal Route 
Search 

We further extend the KOR query to the keyword-aware top-k 
route (KfcR) query. Instead of finding the optimal route defined 
in KOR, the KfcR query is to return the top-fc routes starting and 
ending at the given locations such that they have the best objective 
scores, cover all the query keywords, and satisfy the given budget 
constraint. We introduce how to modify the OSScaling algorithm 
and the BucketBound algorithm for solving KfcR approximately. 

It is relatively straightforward to extend the two approximation 
algorithms OSScaling and BucketBound for processing the KfcR 
query. Due to space limitations, we only briefly present the exten- 



sion. We need to introduce the definition of "fc-dominate". A label 
is "fc-dominated" if at least fc labels dominate it. In the pseudocode 
of OSScaling algorithm, we need to replace "dominate" by "fc- 
dominate." Moreover, instead of keeping track of only the current 
best result, we need to track the current best fc results. The budget 
score of the fcth best route is used as the upper bound U to prune 
unnecessary labels. Similarly, in the BucketBound algorithm, we 
also apply "fc-dominate". Moreover, instead of returning immedi- 
ately when we find a feasible route in the bucket containing Ros, 
the algorithm terminates when we find fc feasible routes from the 
non-empty bucket with the smallest order number. 

Note that we do not extend the greedy algorithm for solving 
KfcR. The greedy approach is not able to guarantee that a feasi- 
ble route can be found. Therefore, it is meaningless to return fc 
routes using such a method. 

4. EXPERIMENTAL STUDY 
4.1 Experimental Settings 

Algorithms. We study the performance of the following proposed 
algorithms: the approximation algorithm OSScaling in Section 3.2, 
the approximation algorithm BucketBound in Section 3.3, and 
the greedy algorithms in Section 3.4, denoted by Greedy-1 and 
Greedy-2 corresponding to selecting the top-1 and top-2 best loca- 
tions, respectively. 

Additionally, we also implemented a naive brute-force approach 
discussed in Section 3.2. However, it is at least 2 orders of magni- 
tude slower than OSScaling and cannot finish after 1 day, and thus 
is omitted. 

Data and queries. We use five datasets in our experimental study. 
The first one is a real-life dataset collected from Flickr 3 using its 
public API. We collected 1,501,553 geo-tagged photos taken by 
30,664 unique users in the region of the New York city in the United 
States. Each photo is associated with a set of user-annotated tags. 
The latitude and the longitude of the place where the photo is taken 
and its taken time are also collected. Following the work [15], we 
utilize a clustering method to group the photos into locations. We 
associate each location with tags obtained by aggregating the tags 
of all photos in that location after removing the noisy tags, such 
as tags contributed by only one user. Finally, we obtain 5,199 lo- 
cations and 9,785 tags in total. Each location is associated with a 
number of photos taken in the location. Next, we sort the photos 
from the same user according to their taken time. If two consecu- 
tive photos are taken at two different places and the taken time gap 
is less than 1 day, we consider that the user made a trip between the 
two locations, and we build an edge between them. 

On each edge, the Euclidean distance between its two vertices 
(locations) serves as the budget value. We compute a popularity 
score for each edge following the idea of the work [4]. The pop- 
ularity of an edge (vi,Vj) is estimated as the probability of the 
edge being visited: Pri j = '" J - , where Num(vi.Vi) is 

the number of trips between Vi and Vj and TotalTrips is the to- 
tal number of trips. The total popularity score of a route R — 
(vo, vi, v n ) is computed as: PS(-R) = n™=i P r i-i,i- How- 
ever, the popularity score should be maximized. To transform the 
maximization problem to the minimization problem as defined in 
KOR, we compute the objective score on each edge (vi,Vj) as: 
o(vi,Vj) = log( p , ). Therefore, if OS(R) is minimized, PS(R) 
is maximized. 



http://www.flickr.com/ 
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The other 4 datasets are generated from real data, mainly for 
scalability experiment. By extracting the subgraph of the New York 
road network 4 , we obtain 4 datasets containing 5,000, 10,000, 
15,000, and 20,000 nodes, respectively. Each node is associated 
with a set of randomly selected tags from the real Flickr dataset. 
The travel distance is used as the budget score, and we randomly 
generate the objective score in the range (0,1) on each edge to cre- 
ate the graphs for the four datasets. 

We generate 5 query sets for the Flickr dataset, in which the num- 
ber of keywords are 2, 4, 6, 8, and 10, respectively. The starting 
and ending locations are selected randomly. Each set comprises 50 
queries. Similarly, we also generate 5 query sets for each of the 4 
other datasets. 

All algorithms were implemented in VC++ and run on an In- 
tel(R) Xeon(R) CPU X5650 @2.66GHz with 4GB RAM. 

4.2 Experimental Results 

4.2.1 Efficiency of Different Algorithms 

The objective of this set of experiments is to study the efficiency 
of the proposed algorithms with variation of the number of query 
keywords and the budget limit A (travel distance). We set the value 
for the scaling parameter e in OSScaling and BucketBound at 0.5, 
the specified parameter j3 at 1.2 for BucketBound, and the default 
value for a in Greedy at 0.5. We conduct the experiment to study 
the runtime when varying the value of e for OSScaling, and the 
experiment to study the runtime when varying the value of j3 for 
BucketBound (e=0.5). Note that the runtime of Greedy is not 
affected by a. 
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Varying the number of query keywords. Figure 4 shows the run- 
time of the four algorithms on the Flickr dataset when we vary the 
number of query keywords. For each number, we report the aver- 
age runtime over five runs, each using a different A, namely 3, 6, 
9, 12, and 15 kilometers, respectively. Note that the y-axis is in 
logarithmic scale. We can see that all the algorithms are reasonably 
efficient on this dataset. As expected, the algorithm OSScaling 
runs much slower than the other three algorithms. BucketBound 
is usually 8-10 times faster than OSScaling, although OSScaling 
and BucketBound have the same worst time complexity. This 
is because BucketBound terminates immediately when a feasi- 
ble route is found in the bucket containing Ros, the route found 
by OSScaling, and thus it generates much fewer labels than does 
OSScaling. The worst time complexity of both OSScaling and 
BucketBound is exponential in the number of query keywords. 
However, as shown in the experiment, the runtime does not increase 
dramatically as the number of query keywords is increased. This 
is due to the two optimization strategies employed in both algo- 
rithms. Without employing the optimization strategies, both algo- 
rithms will be 3-5 times slower. Due to space limitations, we omit 
the details. 



Greedy-1 is the fastest since it only selects the best node in each 
step. However, as to be shown, its accuracy is the worst. Greedy- 
1 is not affected significantly by the number of query keywords. 
The runtime of Greedy-2 increases dramatically with the increase 
of query keywords. This is because Greedy-2 selects the best 2 
nodes at each step, and its asymptotically tight bound complexity 
is exponential in the number of query keywords. 

Varying the budget limit A. Figure 5 shows the runtime of the 
four approaches on the Flickr dataset with the variation of A. At 
each A, the average runtime is reported over 5 runs, each with a 
different number of query keywords from 2 to 10. The runtime 
of OSScaling grows when A increases from 3 km to 6 km as a 
smaller A can prune more routes. However, as A continues to 
increase, the runtime decreases slightly. This is due to the fact 
that with a larger A, OSScaling finds a feasible solution earlier 
(since A is more likely to be satisfied), and then the feasible solu- 
tion can be used to prune the subsequent search space. The saving 
dominates the extra cost incurred by using larger A (notice that 
larger A deteriorates the worst-case performance rather than the 
average performance). As for the other approximation algorithms, 
their runtime is almost not affected by the budget limit as shown in 
the figure. 
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Figure 6: Runtime 



Figure 7: Relative Ratio 



Varying the parameter e for OSScaling. Figure 6 shows the run- 
time of OSScaling when we vary the value of e. We set A as 6 
km and the number of query keywords as 6. It is observed that 
OSScaling runs faster as the value of e increases. This is because 
when e becomes larger, L max , the upper bound of the number of 
labels on a node is decreased, and thus more labels (representing 
partial routes) can be pruned during the algorithm. This is consis- 
tent with the complexity analysis of OSScaling, which shows that 
OSScaling runs linearly in =, 
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Figure 8: Runtime 



Figure 9: Relative Ratio 
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Varying the parameter f} for BucketBound. Figure 8 shows the 
runtime of BucketBound when we vary the value of ft, the spec- 
ified parameter. In this set of experiments, A=6 km, e=0.5, and 
the number of query keywords is 6. As expected, BucketBound 
runs faster as the value of ft increases. This is because when /3 
becomes larger, the interval of each bucket becomes larger and 
each bucket can accommodate more labels. Hence, it is faster for 
BucketBound to find a feasible solution in the bucket containing 
the best route in Q. 
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4.2.2 Accuracy of Approximation Algorithms 

The purpose of this set of experiments is to study the accuracy of 
the approximation algorithms. The brute-forth method discussed 
in Section 3.2 failed to finish for most of settings after more than 
1 day. We note that in the very few successful cases (small A and 
keywords), the practical approximation ratios of OSScaling and 
BucketBound are a lot smaller than their theoretical bounds, com- 
pared with the exact results by the brute-forth method,. To make 
the experiments tractable, we study the relative approximation ra- 
tio. We use the result of OSScaling with e=0. 1 (which has the 
smallest approximation ratio in the proposed methods) as the base 
and compare the relative performance of the other algorithms with 
it. We compute the relative ratio of an algorithm over OSScaling 
with e=0.1 as follows: For each query, we compute the ratio of the 
objective score of the route found by the algorithm to the score of 
the route found by OSScaling with e=0.1, and the average ratio 
over all queries is finally reported as the measure. 

With the measure, we study the effect of the following parame- 
ters on accuracy, namely the number of query keywords, the budget 
limit A, the scaling parameter e in OSScaling, the specified pa- 
rameter /3 in BucketBound, and the parameter a which balances 
the importance of the objective and budget scores during the node 
selection, for Greedy. 



of BucketBound compared to the results of OSScaling is consis- 
tently smaller than the specified /?. 
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Figure 10: Relative Ratio Figure 11: Relative Ratio 
Varying the number of query keywords or A. Figure 10 shows 
the relative ratio compared with the results of OSScaling with 
e=0.1 for the experiment in Figure 4, in which we vary the num- 
ber of query keywords. Figure 1 1 shows the relative ratio for the 
experiment in Figure 5, in which we vary the value of budget limit 
A, respectively. Note that e=0.5 and /3=1.2 in the two experiments. 

Since the greedy algorithms fail to find a feasible solution on 
about 10%-20% queries, for greedy algorithms we measure the rel- 
ative ratio only on the queries where Greedy-1 and Greedy-2 are 
able to find feasible routes. For OSScaling and BucketBound, the 
reported results are based on all queries, which are similar to the re- 
sults if we only use the set of queries for which Greedy returns fea- 
sible solutions. We observe that the relative ratio of BucketBound 
compared with the results of OSScaling is always below the spec- 
ified parameter j3. It can also be observed that BucketBound can 
achieve much better accuracy than do Greedy-1 and Greedy-2, 
especially when the number of query keywords or the value of A 
is large. 

Varying the parameter e for OSScaling. Figure 7 shows the ef- 
fect of e on the relative ratio in OSScaling. We set A as 6 kilome- 
ters and the number of query keywords at 6. We can observe that 
the relative ratio becomes worse as we increase e, which is consis- 
tent with the result of Theorem 2, i.e., the performance bound of 
OSScaling is -L. 

Varying the parameter /? for BucketBound. Figure 9 shows the 
effect of /3 on the relative ratio in BucketBound, while the corre- 
sponding runtime is reported in Figure 8, where we set e=0.5, A=6 
km, and the number of query keywords as 6. As expected, the rela- 
tive ratio becomes worse as we increase /3. Note that relative ratio 




Figure 12: Relative Ratio Figure 13: Failure Percentage 

Varying the parameter a for Greedy. Figure 12 shows the rela- 
tive ratio of Greedy-1 and Greedy-2 compared with the results 
of OSScaling when we vary a, and Figure 13 shows the per- 
centage of failed queries. In this set of experiments, we set A 
as 6 kilometers, and the average performance is reported over 5 
runs, each with a different number of query keywords from 2 to 
10. Note that the relative ratio is computed based on the set of 
queries where Greedy-1 and Greedy-2 are able to find feasible 
routes over the set of queries with feasible solutions (OSScaling 
and BucketBound guarantee to return feasible results if any). We 
observe that as the value of a increases the relative ratio becomes 
worse for both Greedy-1 and Greedy-2, but they succeed in find- 
ing feasible routes for more queries. When a is set as 0, which 
means that the objective score is the only criterion when selecting 
the node in each step of Greedy, both Greedy-1 and Greedy-2 
achieve the best average ratio while the failure percentage is the 
largest. When a=l, the next best node is selected merely based 
on the budget score. Hence, Greedy is able to find feasible re- 
sults on more queries, but the relative accuracy becomes much 
worse on the queries for which Greedy is able to return feasible 
solutions. Greedy-2 outperforms Greedy-1 consistently, because 
more routes are checked in Greedy and it is likely to find more 
feasible and better routes. 

4.2.3 Comparing OSScaling and BucketBound 
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Figure 14: Runtime 
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The aim of this set of experiment is to compare the performance 
of OSScaling and BucketBound when they have the same theo- 
retical approximation ratio. In this set of experiments, A=6 km, 
/3=1.2, and the number of query keywords is 6. The values of e 
are computed according to different performance bounds for both 
algorithms. Figures 14 and 15 show the runtime and relative ratio 
of OSScaling and BucketBound when we vary the performance 
bound, respectively. We observe that BucketBound runs consis- 
tently faster than OSScaling over all performance bounds while 
OSScaling always achieves better relative ratio. 

4.2.4 Performance of Algorithms for KfcR 

We study the performance of the modified versions of the two 
approximation algorithms, i.e., OSScaling and BucketBound for 
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processing KfcR. We set e=0.5, /3=1.2, A=6 km, and the aver- 
age runtime is reported over 5 runs, each with a different number 
of query keywords from 2 to 10. The results are shown in Fig- 
ure 16. BucketBound always outperforms OSScaling in terms of 
runtime. As expected, both algorithms run slower as we increase 
the value of k. In OSScaling, more labels need to be generated for 
larger k, which leads to longer runtime. Algorithm BucketBound 
terminates only after the top-fc feasible routes are found, thus need- 
ing longer query time. 

4.2.5 Experiments on More Datasets 
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Figure 18: Runtime Figure 19: Runtime 

We also conduct experiments on the synthetic dataset containing 
5,000 nodes. Figure 18 and 19 show the runtime when we vary the 
number of query keywords and the value of A, respectively. We 
set e as 0.5 and f3 as 1.2. The comparison results are consistent 
with those on the Flickr dataset. For the relative ratio, we observe 
qualitatively similar results on this dataset as we do on Flickr. We 
omit the results due to space limitations. 

4.2.6 Scalability 

Figure 17 shows the runtime of the proposed algorithms (the 
number of query keywords is 6 and A=30 km). They all scale 
well with the size of the dataset. The relative ratio changes only 
slightly; we omit the details due to the space limitation. 

4.2.7 Example 
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Figure 20: Example Route 1 Figure 21: Example Route 2 

We use one example found in the Flickr dataset to show that 
KOR is able to find routes according to users' various preferences. 
We set the starting location at the Dewitt Clinton park and the des- 
tination at United Nations Headquarters, and the query keywords 
are "jazz", "imax", "vegetation", and "Cappuccino", i.e., a user 



would like to find a route such that he can listen to jazz music, 
watch a movie, eat vegetarian food and have a cup of Cappuccino. 
When we set the distance threshold A as 9 km, the route shown 
in Figure 20 is returned by OSScaling as the most popular route 
that covers all query keywords and satisfies distance threshold. We 
find that according to the historical trips, this route has the most 
visitors among all routes covering all the query keywords shorter 
than 9 km. However, when A is set as 6 km, the route shown in 
Figure 21 is returned. This route has the most visitors among all 
feasible routes given A=6 km. In the case, the route in Figure 20 
exceeds the limit A=6 km and is pruned during the execution of 
OSScaling algorithm. 



5. RELATED WORK 

Travel route search: The travel route search problem has received 
a lot of attention. Li et al. [17] propose a new query called Trip 
Planning Query (TPQ) in spatial databases, in which each spatial 
object has a location and a category, and the objects are indexed 
by an R-tree. A TPQ has three components: a start location s, an 
end location t, and a set of categories C, and it is to find the short- 
est route that starts at s, passes through at least one object from 
each category in C and ends at t. It is shown that TPQ can be 
reduced from the Traveling Salesman problem, which is NP-hard. 
Based on the triangle inequality property of metric space, two ap- 
proximation algorithms including a greedy algorithm and an inte- 
ger programming algorithm are proposed. Compared with TPQ, 
KOR studied in this paper includes an additional constraint (the 
budget constraint), and thus is more expressive. The algorithms in 
the work [17] cannot be used to process KOR. 

Sharifzadeh et al. [22] study a variant problem of TPQ [17], 
called optimal sequenced route query (OSR). In OSR, a total or- 
der on the categories C is imposed and only the starting location s 
is specified. The authors propose two elegant exact algorithms L- 
LORD and R-LORD. Under the same setting [17] that objects are 
stored in spatial databases and indexed by an R-tree, metric space 
based pruning strategies are developed in the two exact algorithms. 

Chen et al. [3] considers the multi-rule partial sequenced route 
(MRPSR) query, which is a unified query of TPQ and OSR. Three 
heuristic algorithms are proposed to answer MRPSR. KOR is dif- 
ferent from OSR and MRPSR and the their algorithms are not ap- 
plicable to process KOR. 

Kanza et al. [14] consider a different route search query on the 
spatial database: the length of the route should be smaller than a 
specified threshold while the total text relevance of this route is 
maximized. Greedy algorithm is proposed without guaranteeing to 
find a feasible route. Their subsequent work [12] develops several 
heuristic algorithms for answering a similar query in an interac- 
tive way. After visiting each object, the user provides feedback on 
whether the object satisfies the query, and the feedback is consid- 
ered when computing the next object to be visited. In the work [16], 
approximate algorithms for solving OSR [22] in the presence of 
order constraints in an interactive way are developed. Kanza et 
al. also study the problem of searching optimal sequenced route in 
probabilistic spatial database [13]. Lu et al. [18] consider the same 
query [14] and propose a data mining-based approach. The queries 
considered in these works are different from KOR and these algo- 
rithms cannot be used to answer KOR. 

Malviya et al. [20] tackle the problem of answering continu- 
ous route planning queries over a road network. The route plan- 
ning [20] aims to find the shortest path in the presence of updates 
to the delay estimates. Roy et al. [21] consider the problem of in- 
teractive trip planning, in which the users give feedbacks for the 
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already suggested points-of-interests, and the itineraries are con- 
structed iteratively based on the users' preferences and time budget. 
Obviously, these two problems are different with KOR. 

Yao et al. [26] propose the multi-approximate-keyword routing 
(MARK) query. A MARK query is specified by a starting and an 
ending location, and a set of (keyword, threshold) value pairs. It 
searches for the route with the shortest length such that it covers 
at least one matching object per keyword with the similarity larger 
than the corresponding threshold value. Obviously, MARK has dif- 
ferent aims with that of the KOR query. 

The collective spatial keyword search [2] is related to our prob- 
lem, where a group of objects that are close to a query point and col- 
lectively cover a set of a set of query keywords are returned as the 
result. However, the KOR query requires a route satisfying a budget 
constraint rather than a set of independent locations. Our problem 
is also relevant to the spatial keyword search queries [1,6] where 
both spatial and textual features are taken into account during the 
query processing. However, they retrieve single objects while the 
KOR query finds a route. 

Travel route recommendation: Recent works on travel route rec- 
ommendation aim to recommend routes to users based on users' 
travel histories. Lu et al. [19] collect geo- tagged photos from Flickr 
and build travel routes from them. They define popularity scores 
on each location and each trip, and recommend a route that has the 
largest popularity score within a travel duration in the whole dataset 
for a city. The recommendation in this work is not formulated as 
queries and the recommendation algorithm runs in an extreme long 
time. The work [4] finds popular routes from users' historical tra- 
jectories. The popularity score is defined as the probability from the 
source location to the target location estimated using the absorbing 
Markov model based on the trajectories. Yoon et al. [27] propose a 
smart recommendation, based on multiple user-generated GPS tra- 
jectories, to efficiently find itineraries. The work [15] predicts the 
subsequent routes according to the user's current trajectory and pre- 
vious trajectory history. None of these proposals takes into account 
the keywords as we do in this work. 

6. CONCLUSION AND FUTURE WORK 

In this paper, we define the problem of keyword-aware opti- 
mal route query, denoted by KOR, which is to find an optimal 
route such that it covers a set of user-specified keywords, a spec- 
ified budget constraint is satisfied, and the objective score of the 
route is optimized. The problem of answering KOR queries is NP- 
hard. We devise two approximation algorithms, i.e., OSScaling 
and BucketBound with provable approximation bounds for this 
problem. We also design a greedy approximation algorithm. Re- 
sults of empirical studies show that all the proposed algorithms are 
capable of answering KOR queries efficiently, while the algorithms 
BucketBound and Greedy run faster. We also study the accuracy 
of approximation algorithms. 

In the future work, we would like to improve the current pre- 
processing approach. We can employ a graph partition algorithm 
to divide a large graph into several subgraphs. Next, we only do the 
pre-processing within each subgraph instead of on the whole graph. 
We also compute and store the best objective and budget score be- 
tween every pair of border nodes. Thus, the path with the best 
objective or budget score can be obtained from the pre-processing 
results. We believe that this approach can greatly reduce the time 
and space costs of the pre-processing. 
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